Loading...
Designed and deployed a multi-region data replication system that automatically copies critical business data to a secondary region for disaster recovery — with encryption at both ends, intelligent cost management, and a multi-tier alerting system that escalates issues before they become outages.
Client: Enterprise Client
The challenge
The client stored critical business data in one AWS region. If that region experienced an outage — however unlikely — the data would be inaccessible and operations would stop. For compliance and business continuity, they needed their data replicated to a second geographic region automatically, with guarantees that replication was actually working and that they'd know immediately if it fell behind.
They also needed the replicated data to be encrypted with region-specific keys (a compliance requirement), and they needed visibility into replication health without having to manually check dashboards every day. The system needed to alert the right people at the right severity level — a minor delay shouldn't wake up a manager, but a complete replication failure should escalate fast.
Cost was also a concern. The data would grow over time, and storing everything at the highest storage tier indefinitely would become expensive.
Architecture diagram: cross-region replication with encryption and monitoring
Screenshot coming soon
What we built
We built a cross-region replication system that automatically copies data from the primary region to a disaster recovery region the moment it's written. The system operates continuously with no manual intervention and includes comprehensive monitoring that escalates issues through three severity tiers.
Data is encrypted at both ends using region-specific encryption keys that rotate automatically every year. The replication service decrypts data with the source key and re-encrypts it with the destination key during transfer — data is never unencrypted in transit or at rest.
To manage storage costs as data grows, we implemented intelligent lifecycle policies. Recent data stays on fast storage for immediate access. After 30 days, it moves to a cheaper tier. After 90 days, it moves to archive storage. After a year, it moves to deep archive. Old versions are cleaned up automatically after two years. This means the client only pays premium storage prices for data they're actively using.
The monitoring system operates across both regions with a tiered escalation model. Standard alerts (minor delays, temporary blips) notify the operations team via email. Critical alerts (replication failures, sustained latency) escalate to a dedicated critical channel. If both system health AND infrastructure health alarms fire simultaneously, it escalates to management — indicating a serious issue that needs immediate executive attention.
We also deployed threat detection and data classification services on the replicated data. The system continuously scans for unauthorised access attempts and automatically classifies sensitive data, feeding findings into a centralised compliance dashboard.
CloudWatch dashboard: replication latency, pending operations, and throughput
Screenshot coming soon
SNS escalation topology: standard → critical → management tiers
Screenshot coming soon
Technical detail
This section is for readers with a technical background who want to understand the architecture and implementation choices.
The infrastructure spans two AWS regions and is deployed as nested CloudFormation stacks — a source orchestration stack (us-east-1) and a destination orchestration stack (eu-west-2), each composed of 4-8 child stacks.
S3 Replication: A replication rule on the source bucket targets objects under a specific prefix. Versioning is enabled on both buckets (required for replication). The replication SLA is configurable (default: 15 minutes). An IAM replication role assumed by the S3 service handles cross-region trust, scoped with aws:SourceAccount conditions.
KMS Key Management: Symmetric CMKs in each region with automatic annual rotation. Key policies follow least-privilege: only the S3 service (via kms:ViaService condition) and the replication role can use the keys. The replication role has Decrypt on the source key and GenerateDataKey + Decrypt on the destination key.
Lifecycle Policies (both buckets): - Standard → Standard-IA after 30 days - Standard-IA → Glacier after 90 days - Glacier → Deep Archive after 365 days - Non-current versions expire after 730 days - Incomplete multipart uploads abort after 7 days - Intelligent Tiering with Archive Access (90 days) and Deep Archive Access (180 days)
Monitoring — Destination Region: - Replication Failure alarm (OperationsFailedReplication > 0, 2 eval periods) - Replication Latency Warning (> 30 min) and Critical (> 1 hour) - Pending Operations (> 100, 3 eval periods) - Bytes Pending (> 1 GB, 3 eval periods) - Destination Bucket 4xx Errors (> 10) - KMS Destination Failures (> 5)
Monitoring — Source Region: - Replication Latency exceeds SLA - Source Bucket Object Count (≤ 0 indicates data loss) - Source Bucket Size Growth (> 1 TB warning) - Source Bucket 4xx and 5xx Errors - Source KMS Key Usage throttling
Composite Alarms & Escalation: - SystemHealthCompositeAlarm = Replication Failure OR Critical Latency → publishes to Critical Escalation Topic - InfrastructureHealthCompositeAlarm = S3 Errors OR KMS Failures - EscalationCompositeAlarm = System Health AND Infrastructure Health both in ALARM → publishes to Management Escalation Topic - Cross-region EventBridge rule forwards critical alarm state changes back to the source region for visibility
Security Services (destination region): - Amazon GuardDuty — continuous threat detection on the destination bucket - Amazon Macie — automated sensitive data classification - AWS CloudTrail — audit logging for all S3 and KMS API calls - All findings feed into AWS Security Hub for centralised compliance posture
Dashboards: Two CloudWatch dashboards — a Primary Destination Dashboard (multi-region overview with replication latency, pending operations, bytes pending, storage class distribution, KMS usage, error rates) and a Regional Dashboard (destination-focused metrics).
The results
Security Hub: compliance posture and threat detection findings
Screenshot coming soon
Interested in something similar?
Book a free 30-minute discovery call. We'll listen to what you need, tell you what's realistic, and give you a straight answer on whether we can help.