Published: Nov 03, 2025
First‑of‑its‑kind multi‑region Databricks Disaster Recovery solution
In superannuation, trust is built over decades and can be shaken in moments.
For one of Australia’s largest funds, managing more than AUD $150 billion in retirement savings, trust depends on the certainty that the systems and data behind every decision and member interaction will always be available.
At the heart of their digital ecosystem sits a powerful Databricks platform - powering compliance reporting, investment insights, and personalised digital experiences.
But there was a challenge: enterprise data platforms like Databricks integrate storage, ETL, reporting, and analytics. If a full Azure regional outage occurred, these services could be disrupted. Native Databricks architecture offers high‑availability within a single region, but multi‑region failover and failback are not built‑in.
The fund required a disaster recovery solution that could:
- Operate across two regions
- Fail over to a secondary site in under 4 hours
- Fail back without breaking governance or losing data generated in the secondary site
- Be repeatable, proven, and fully tested
The business challenge
A prolonged disruption to the fund’s Databricks environment - for example, from a full Azure regional outage - could halt reporting cycles, delay investment decisions, and prevent members from accessing accounts.
The leadership needed certainty that:
- The Databricks environment could continue through multi‑region failover and failback without data loss
- Critical services could resume within a committed 4‑hour recovery window
- Resilience could be demonstrated through live simulation and documented for regulators
Co‑creating the solution
With no off‑the‑shelf option, NCS partnered with the fund’s technology and data teams to design a multi‑region disaster recovery model for Databricks on Azure.
Custom automation and scripting were developed to replicate Databricks services, metadata, data pipelines, workloads, and ingested data to a secondary region - enabling seamless failover and recovery from the last checkpoint.
Key elements:
- Dual Databricks deployment in geographically separate Azure regions
- Four‑hour recovery window for critical workloads
- Full failback capability without breaking governance or lineage
- Repeatable results proven through full‑scale live testing with no service disruption
The Impact
The result is one of the most resilient Databricks deployments in the superannuation sector - and the first known multi‑region DR implementation for Databricks on Azure.
By extending Databricks’ robust platform capabilities, this solution delivers cross‑region replication, failover and failback for mission‑critical workloads in regulated environments.
Benefits:
- Business continuity during region‑level outages
- Regulatory confidence through documented DR readiness
- Member trust strengthened by uninterrupted service
The DR solution addressed:
- Cross‑region workspace deployment
- Metastore and data replication between regions
- Job and pipeline failover & failback
- Automated configuration replication to maintain governance and security
All components were validated through full‑scale live testing, proving the solution is repeatable and able to meet the 4‑hour recovery target without compromising data integrity.
Why it matters
In superannuation, resilience is a competitive advantage. This multi‑region Databricks capability runs quietly in the background, giving leaders, regulators, and members confidence that their data foundation is secure - whatever happens next.