Cloud InfrastructureAutomationOnline Retail

Cross-Region Disaster Recovery & Data Replication

Designed and deployed a multi-region data replication system that automatically copies critical business data to a secondary region for disaster recovery — with encryption at both ends, intelligent cost management, and a multi-tier alerting system that escalates issues before they become outages.

Client: Enterprise Client

The challenge

What the client needed to fix

The client stored critical business data in one AWS region. If that region experienced an outage — however unlikely — the data would be inaccessible and operations would stop. For compliance and business continuity, they needed their data replicated to a second geographic region automatically, with guarantees that replication was actually working and that they'd know immediately if it fell behind.

They also needed the replicated data to be encrypted with region-specific keys (a compliance requirement), and they needed visibility into replication health without having to manually check dashboards every day. The system needed to alert the right people at the right severity level — a minor delay shouldn't wake up a manager, but a complete replication failure should escalate fast.

Cost was also a concern. The data would grow over time, and storing everything at the highest storage tier indefinitely would become expensive.

Process flow: how data replication protects your businessClick to enlarge

What we built

The solution we delivered

We built a cross-region replication system that automatically copies data from the primary region to a disaster recovery region the moment it's written. The system operates continuously with no manual intervention and includes comprehensive monitoring that escalates issues through three severity tiers.

Data is encrypted at both ends using region-specific encryption keys that rotate automatically every year. The replication service decrypts data with the source key and re-encrypts it with the destination key during transfer — data is never unencrypted in transit or at rest.

To manage storage costs as data grows, we implemented intelligent lifecycle policies. Recent data stays on fast storage for immediate access. After 30 days, it moves to a cheaper tier. After 90 days, it moves to archive storage. After a year, it moves to deep archive. Old versions are cleaned up automatically after two years. This means the client only pays premium storage prices for data they're actively using.

The monitoring system operates across both regions with a tiered escalation model. Standard alerts (minor delays, temporary blips) notify the operations team via email. Critical alerts (replication failures, sustained latency) escalate to a dedicated critical channel. If both system health AND infrastructure health alarms fire simultaneously, it escalates to management — indicating a serious issue that needs immediate executive attention.

We also deployed threat detection and data classification services on the replicated data. The system continuously scans for unauthorised access attempts and automatically classifies sensitive data, feeding findings into a centralised compliance dashboard.

Technical detail

Under the hood

This section is for readers with a technical background who want to understand the architecture and implementation choices.

Architecture diagram: cross-region replication with encryption and monitoringClick to enlarge

The infrastructure spans two AWS regions and is deployed as nested CloudFormation stacks — a source orchestration stack (us-east-1) and a destination orchestration stack (eu-west-2), each composed of 4-8 child stacks.

S3 Replication: A replication rule on the source bucket targets objects under a specific prefix. Versioning is enabled on both buckets (required for replication). The replication SLA is configurable (default: 15 minutes). An IAM replication role assumed by the S3 service handles cross-region trust, scoped with aws:SourceAccount conditions.

KMS Key Management: Symmetric CMKs in each region with automatic annual rotation. Key policies follow least-privilege: only the S3 service (via kms:ViaService condition) and the replication role can use the keys. The replication role has Decrypt on the source key and GenerateDataKey + Decrypt on the destination key.

Lifecycle Policies (both buckets): - Standard → Standard-IA after 30 days - Standard-IA → Glacier after 90 days - Glacier → Deep Archive after 365 days - Non-current versions expire after 730 days - Incomplete multipart uploads abort after 7 days - Intelligent Tiering with Archive Access (90 days) and Deep Archive Access (180 days)

Monitoring — Destination Region: - Replication Failure alarm (OperationsFailedReplication > 0, 2 eval periods) - Replication Latency Warning (> 30 min) and Critical (> 1 hour) - Pending Operations (> 100, 3 eval periods) - Bytes Pending (> 1 GB, 3 eval periods) - Destination Bucket 4xx Errors (> 10) - KMS Destination Failures (> 5)

Monitoring — Source Region: - Replication Latency exceeds SLA - Source Bucket Object Count (≤ 0 indicates data loss) - Source Bucket Size Growth (> 1 TB warning) - Source Bucket 4xx and 5xx Errors - Source KMS Key Usage throttling

Composite Alarms & Escalation: - SystemHealthCompositeAlarm = Replication Failure OR Critical Latency → publishes to Critical Escalation Topic - InfrastructureHealthCompositeAlarm = S3 Errors OR KMS Failures - EscalationCompositeAlarm = System Health AND Infrastructure Health both in ALARM → publishes to Management Escalation Topic - Cross-region EventBridge rule forwards critical alarm state changes back to the source region for visibility

Security Services (destination region): - Amazon GuardDuty — continuous threat detection on the destination bucket - Amazon Macie — automated sensitive data classification - AWS CloudTrail — audit logging for all S3 and KMS API calls - All findings feed into AWS Security Hub for centralised compliance posture

Dashboards: Two CloudWatch dashboards — a Primary Destination Dashboard (multi-region overview with replication latency, pending operations, bytes pending, storage class distribution, KMS usage, error rates) and a Regional Dashboard (destination-focused metrics).

The results

What changed for the client

Critical business data is automatically replicated to a second geographic region within 15 minutes
Data is encrypted at rest in both regions with automatically-rotating keys — meeting compliance requirements
Storage costs are managed automatically — data moves to cheaper tiers as it ages
Three-tier alerting ensures the right people are notified at the right severity level
Threat detection and data classification run continuously on replicated data
Two monitoring dashboards provide real-time visibility into replication health across both regions
The system operates continuously with zero manual intervention
Full audit trail of all data access and replication operations for compliance reporting

Geographic regions

<15 min

Replication SLA

3-tier

Alert escalation

30+

Monitoring alarms

Interested in something similar?

Let's talk about your project

Book a free 30-minute discovery call. We'll listen to what you need, tell you what's realistic, and give you a straight answer on whether we can help.

Book a Free Discovery Call View more work

Cross-Region Disaster Recovery & Data Replication

Client: Enterprise Client

What the client needed to fix

Cost was also a concern. The data would grow over time, and storing everything at the highest storage tier indefinitely would become expensive.

Process flow: how data replication protects your businessClick to enlarge

The solution we delivered

Under the hood

This section is for readers with a technical background who want to understand the architecture and implementation choices.

Architecture diagram: cross-region replication with encryption and monitoringClick to enlarge

What changed for the client

Critical business data is automatically replicated to a second geographic region within 15 minutes

Data is encrypted at rest in both regions with automatically-rotating keys — meeting compliance requirements

Storage costs are managed automatically — data moves to cheaper tiers as it ages

Three-tier alerting ensures the right people are notified at the right severity level

Threat detection and data classification run continuously on replicated data

Two monitoring dashboards provide real-time visibility into replication health across both regions

The system operates continuously with zero manual intervention