compliances , security , virtual-vmware , visio-stencils

TSM Remote Backup to BCRS Disaster Recovery

July 2, 2024
  • Objectives
    • Restore the customer workload to manageability when moving the workload to a different Site
    • Reintegrate VMs with possibly different IDs into the management structure all the way up to ISM
  • Use Cases
    • DR Test/Declaration to DR Site
    • Failback
  • Assumptions
    • The destination Site can be currently running customer workload, or one that has been initialized/prepped from scratch (as in a Site reconstruction)
    • The VMs to be “onboarded” already have entries in the Central Level (e.g., ISM) management tools
    • Priority ordering of management functionality restoration exists
      • Backup, monitoring, patching, billing, …
    • DR RTO spec does not apply to Restoration of Manageability
    • Customer workload has been started on the destination site (same assumption holds for tape-based and GM-based workload)
    • Customer’s VMs have been assigned Management IPs from the setaside pool
    • Managing infrastructure is intact on destination Site
    • The RTM function will be orchestrated by the Site-Level DR Orchestrator
      • Automated and/or Manual means are acceptable
  • After customer workload has been restarted, restore the workload to full manageability
    • It is TBD whether the workload will be returned to customer control in a less than fully managed state, perhaps as an interim step while full manageability is achieved
  • This includes the following functions
    • Service catalog/portal (TSRM)
    • Service management [incident, problem, change, configuration, & asset management] (ISM, CCMDB, TADDM, TAMIT, TAD4D)
    • Usage accounting (TUAM)
    • Backup (TSM)
    • Monitoring (ITM, RTEMS, HTEMS, NetCool)
    • Storage Monitoring (TPC)
    • Reporting (TDW, SRM, GSMRT, GACDW)
    • Provisioning (TSAM/TPM)
    • Patching (PAE, TEM)
    • Infrastructure readiness (DNS, LDAP, etc)
  • Install and register agents
    • ITM: Update agent configuration to ip address change of the ITM server
    • Add to management group
    • Configure LDAP client (AIX & Linux)
    • Configure SSH keys & Active Directory access (VMware only)
    • Configure TSCM agent with address of TSCM server
    • Create and register TSM node
    • Configure TSM agent with IP address of TSM server
    • Install/configure TAD4D agent with IP address of TAD4D server
    • Install/configure TEM agent with IP address of TEM server (Note: unlike initial provisioning, do not perform patching)
  • Create resource record in the Netcool Database (for ticketing)
  • Create/adjust CIs in CMDB
  • Create/adjust asset record in ISM
  • Trigger TADDM discovery

Customer Requirements: 

  • Recover to BCRS location during loss of primary site
  • RPO / RTO = 48 / 72 hours
  • Customer selects VM’s which require DR
  • Separate DR contract signed with BCRS
  • Limited management capability in recovery site

Solution:

  • Network connectivity established between and BCRS site   
  • TSM node replication to TSM server at BCRS site
  • TSM servers and storage established at BCRS site
  • VM configuration data made available to BCRS for recovery
  • Customer to provide network connectivity into BCRS center
  • Recover on shared or dedicated HW at BCRS center
  • Failback to site after disaster / test
  • RTP to BCRS STF (Hot Site)
  • TSM Node Replication

TSM Remote Backup to BCRS Recovery Site

  • Primary site
    • Provision the primary VM as usual
    • Initiate backups and off site vaulting at primary site
      • Tapes contain backups of system disk, registry, and data disks
  • Enable DR
    • Model is to provision each customer to a dedicated environment: DR environment will not be a multi-tenant environment
      • Convey VM Identity and Configuration to BCRS
      • BCRS to provide example inputs, E+ will auto harvest and xmit
    • Initiate TSM remote backups to BCRS for STS-DR-enabled workload
      • Validate that IPSEC Red-Red line level encryption is ok with IES OMT sponsor
    • Configure networking at primary and DR site
      • Enable customer access at DR time (same customer IP numbers)
    • Also prepare for DR of essential management tools
      • AD, LDAP, DNS, Backup, Monitoring, others TBD


Primary site is declared failed by IBM

  • Secondary BCRS site has already been identified
    • Notification to BCRS that DR is beginning
  • Data Recovery
    • None required – backup data is already resident in TSM Servers at BCRS site
  • Workload Recovery
    • Images provisioned
      • BCRS Hot Site recovery BAU for STS, VSR Bronze for R1.3
      • Management tools first, then customer workload
    • TSM Clients updated to point to TSM Server
    • VM’s data restored from TSM server
    • VM’s reconnected to customer production networking
    • Restore management continuity as appropriate
      • Management IP numbers may differ
      • Will re-provision management and IBR network
      • IBM admin team will manage VMs
  • Completion
    • Validation
    • Notification that DR is complete
    • How to handle customers that did not contract for DR?
  • Limited management activity
    • DNS, LDAP, AD, AAA, Backup, Monitoring…
    • E+ / IBM Team will provide admin services to customer workload
  • Daily backups continue
    • Full backups will already be in residence
    • Have to make storage pool backup on DR Day one to get full offsite
    • Encrypted daily offsite tapes will be provided
  • Performance may be reduced
  • Changes to VMs (other than those captured by backups) will be lost after failback
    • Only minimal changes allowed
    • Failback will include verification of compliance via modification of Rapid Migration
    • How long can customers live in BCRS site
    • Type T’s and C’s is 6 weeks followed by a daily fee
  • Primary Site is still standing (e.g., power outage)
    • Failback to primary site, where images and management structure is intact
    • Start VMs from crash-consistent images in residence at primary site
    • Cases
      • VMs that were unchanged at the DR site can simply be restarted on the primary site
      • VMs that were changed at the DR site may have their changes lost
    • Restore VM data from remote TSM Server at BCRS Site
      • DR protected VMs will be restored from backups taken at BCRS site
      • Export tapes from BCRS site, reimport to E+ site to seed recovery, top off with electronic back replication via TSM
      • Data Persistence VMs will be restored from crash disk image at Time of Disaster since they were not started at DR site
    • Re-establish manageability
    • Verify compliance
    • Verify workload operation
    • Re-establishment of customer network connectivity
    • Notification
    • An outage of TBD duration will be incurred for failback, should be smaller than DR
  • Primary Site is not standing (e.g., smoking hole)
    • Reconstruct (or new) primary site
    • Onboard customers to reconstructed site
    • Provision workload to reconstructed site
    • Restore VM data from remote TSM Server at BCRS Site
      • DR protected VMs will be restored from backups taken at BCRS site
      • Export tapes from BCRS site, reimport to E+ site to seed recovery, top off with electronic back replication via TSM
      • Data Persistence VMs will be restored from crash disk image at Time of Disaster since they were not started at DR site
    • Re-establish manageability
      • Very different from non-smoking hole restoration
    • Verify compliance
    • Verify workload operation
    • Re-establishment of customer network connectivity
    • Notification
    • An outage of TBD duration will be incurred for failback, should be smaller than DR